Learning Spatial-Frequency Transformer for Visual Object Tracking
نویسندگان
چکیده
Recently, some researchers have begun to adopt the Transformer combine or replace widely used ResNet as their new backbone network. As captures long-range relations between pixels well using self-attention scheme, which complements issues caused by limited receptive field of CNN. Although trackers work in regular scenarios, they simply flatten 2D features into a sequence better match Transformer. We believe these operations ignore spatial prior target object, may lead sub-optimal results only. In addition, many works demonstrate that is actually low-pass filter, independent input keys/queries. That say, it suppress high-frequency component and preserve even amplify low-frequency information. To handle issues, this paper, we propose unified Spatial-Frequency models Gaussian Prior High-frequency emphasis Attention (GPHA) simultaneously. be specific, generated dual Multi-Layer Perceptrons (MLPs) injected similarity matrix produced multiplying Query Key self-attention. The output will fed softmax layer then decomposed two components, i.e., direct signal. low- high-pass branches are rescaled combined achieve all-pass, therefore, protected stacked layers. further integrate Siamese tracking framework novel algorithm termed SFTransT. cross-scale fusion based SwinTransformer adopted backbone, also multi-head cross-attention module boost interaction search template features. head for localization. Extensive experiments on short-term long-term benchmarks all effectiveness our proposed framework. Source code released at https://github.com/Tchuanm/SFTransT.git.
منابع مشابه
Learning Spatial-Aware Regressions for Visual Tracking
In this paper, we analyze the spatial information of deep features, and propose two complementary regressions for robust visual tracking. First, we propose a kernelized ridge regression model wherein the kernel value is defined as the weighted sum of similarity scores of all pairs of patches between two samples. We show that this model can be formulated as a neural network and thus can be effic...
متن کاملVisual Learning in Multiple-Object Tracking
BACKGROUND Tracking moving objects in space is important for the maintenance of spatiotemporal continuity in everyday visual tasks. In the laboratory, this ability is tested using the Multiple Object Tracking (MOT) task, where participants track a subset of moving objects with attention over an extended period of time. The ability to track multiple objects with attention is severely limited. Re...
متن کاملConvolutional Gating Network for Object Tracking
Object tracking through multiple cameras is a popular research topic in security and surveillance systems especially when human objects are the target. However, occlusion is one of the challenging problems for the tracking process. This paper proposes a multiple-camera-based cooperative tracking method to overcome the occlusion problem. The paper presents a new model for combining convolutiona...
متن کاملLearning Object Intrinsic Structure for Robust Visual Tracking
In this paper, a novel method to learn the intrinsic object structure for robust visual tracking is proposed. The basic assumption is that the parameterized object state lies on a low dimensional manifold and can be learned from training data. Based on this assumption, firstly we derived the dimensionality reduction and density estimation algorithm for unsupervised learning of object intrinsic ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Circuits and Systems for Video Technology
سال: 2023
ISSN: ['1051-8215', '1558-2205']
DOI: https://doi.org/10.1109/tcsvt.2023.3249468